# House Price Prediction with Amazon SageMaker FeatureStore



Kernel `Python 3 (Data Science)` works well with this notebook.

The following policies need to be attached to the execution role:
- AmazonSageMakerFullAccess
- AmazonS3FullAccess

## Contents
1. [Background](#Background)
1. [Setup SageMaker FeatureStore](#Setup-SageMaker-FeatureStore)
1. [Inspect Dataset](#Inspect-Dataset)
1. [Ingest Data into FeatureStore](#Ingest-Data-into-FeatureStore)
1. [Build Training Dataset](#Build-Training-Dataset)
1. [Train and Deploy the Model](#Train-and-Deploy-the-Model)
1. [SageMaker FeatureStore At Inference](#SageMaker-FeatureStore-During-Inference)
1. [Cleanup Resources](#Cleanup-Resources)

## Background

Amazon SageMaker FeatureStore is a new SageMaker capability that makes it easy for customers to create and manage curated data for machine learning (ML) development. SageMaker FeatureStore enables data ingestion via a high TPS API and data consumption via the online and offline stores. 

This notebook provides an example for the APIs provided by SageMaker FeatureStore by walking through the process of training a House Price prediction model. The notebook demonstrates how the dataset's tables can be ingested into the FeatureStore, queried to create a training dataset, and quickly accessed during inference. 


### Terminology

A **FeatureGroup** is the main resource that contains the metadata for all the data stored in SageMaker FeatureStore. A FeatureGroup contains a list of FeatureDefinitions. 

A **FeatureDefinition** consists of a name and one of the following data types: a integral, string or decimal.

The FeatureGroup also contains an **OnlineStoreConfig** and an **OfflineStoreConfig** controlling where the data is stored. Enabling the online store allows quick access to the latest value for a Record via the GetRecord API. The offline store, a required configuration, allows storage of historical data in your S3 bucket. 

Once a FeatureGroup is created, data can be added as Records. 

**Records** can be thought of as a row in a table. Each record will have a unique **RecordIdentifier** along with values for all other FeatureDefinitions in the FeatureGroup. 

## Setup SageMaker FeatureStore

Let's start by setting up the SageMaker Python SDK and boto client. Note that this notebook requires a `boto3` version above `1.17.21`

In [2]:
import boto3
import sagemaker

original_boto3_version = boto3.__version__
%pip install 'boto3>1.17.21'

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
from sagemaker.session import Session

region = boto3.Session().region_name

boto_session = boto3.Session(region_name=region)

# create a sagemaker client
sagemaker_client = boto_session.client(service_name="sagemaker", region_name=region)

# create a feature store runtime 
featurestore_runtime = boto_session.client(
    service_name="sagemaker-featurestore-runtime", region_name=region
)

# create a feature store session 
feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime,
)

#### S3 Bucket Setup For The OfflineStore

SageMaker FeatureStore writes the data in the OfflineStore of a FeatureGroup to a S3 bucket owned by you. To be able to write to your S3 bucket, SageMaker FeatureStore assumes an IAM role which has access to it. The role is also owned by you.
Note that the same bucket can be re-used across FeatureGroups. Data in the bucket is partitioned by FeatureGroup.

Set the default s3 bucket name and it will be referenced throughout the notebook.

In [4]:
# You can modify the following to use a bucket of your choosing
default_s3_bucket_name = feature_store_session.default_bucket()
prefix = "sagemaker-featurestore-demo"

print(default_s3_bucket_name)

sagemaker-us-east-1-904981812149


In [5]:
from sagemaker import get_execution_role

# You can modify the following to use a role of your choosing. See the documentation for how to create this.
role = get_execution_role()
print(role)

arn:aws:iam::904981812149:role/LabRole


# Reading data

In [6]:
import pandas as pd
housing_data = pd.read_csv("data/housing.csv")
google_maps_data = pd.read_csv("data/housing_gmaps_data_raw.csv")

In [7]:
# view the top 5 records of housing data
housing_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [8]:
# view the top 5 records of google maps data
google_maps_data.head()

Unnamed: 0,street_number,route,locality-political,administrative_area_level_2-political,administrative_area_level_1-political,country-political,postal_code,address,longitude,latitude,...,establishment-natural_feature,airport-establishment-point_of_interest,political-sublocality-sublocality_level_1,administrative_area_level_3-political,post_box,establishment-light_rail_station-point_of_interest-transit_station,establishment-point_of_interest,aquarium-establishment-park-point_of_interest-tourist_attraction-zoo,campground-establishment-lodging-park-point_of_interest-rv_park-tourist_attraction,cemetery-establishment-park-point_of_interest
0,3130,Grizzly Peak Boulevard,Berkeley,Alameda County,California,United States,94705.0,"3130 Grizzly Peak Blvd, Berkeley, CA 94705, USA",-122.23,37.88,...,,,,,,,,,,
1,2005,Tunnel Road,Oakland,Alameda County,California,United States,94611.0,"2005 Tunnel Rd, Oakland, CA 94611, USA",-122.22,37.86,...,,,,,,,,,,
2,6886,Chabot Road,Oakland,Alameda County,California,United States,94618.0,"6886 Chabot Rd, Oakland, CA 94618, USA",-122.24,37.85,...,,,,,,,,,,
3,6365,Florio Street,Oakland,Alameda County,California,United States,94618.0,"6365 Florio St, Oakland, CA 94618, USA",-122.25,37.85,...,,,,,,,,,,
4,5407,Bryant Avenue,Oakland,Alameda County,California,United States,94618.0,"5407 Bryant Ave, Oakland, CA 94618, USA",-122.25,37.84,...,,,,,,,,,,


In [9]:
# print out all the columns of housing data
housing_data.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')

In [10]:
# print out all the columns of goolge maps data
google_maps_data.columns

Index(['street_number', 'route', 'locality-political',
       'administrative_area_level_2-political',
       'administrative_area_level_1-political', 'country-political',
       'postal_code', 'address', 'longitude', 'latitude',
       'neighborhood-political', 'postal_code_suffix',
       'establishment-point_of_interest-transit_station',
       'establishment-park-point_of_interest', 'premise',
       'establishment-point_of_interest-subway_station-transit_station',
       'airport-establishment-finance-moving_company-point_of_interest-storage',
       'subpremise',
       'bus_station-establishment-point_of_interest-transit_station',
       'establishment-park-point_of_interest-tourist_attraction',
       'establishment-natural_feature',
       'airport-establishment-point_of_interest',
       'political-sublocality-sublocality_level_1',
       'administrative_area_level_3-political', 'post_box',
       'establishment-light_rail_station-point_of_interest-transit_station',
       'e

In [11]:
merged_data = pd.merge(housing_data, google_maps_data, on=['latitude', 'longitude'], how='inner')


In [12]:
# print the top 5 records of merged table
merged_data.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,...,establishment-natural_feature,airport-establishment-point_of_interest,political-sublocality-sublocality_level_1,administrative_area_level_3-political,post_box,establishment-light_rail_station-point_of_interest-transit_station,establishment-point_of_interest,aquarium-establishment-park-point_of_interest-tourist_attraction-zoo,campground-establishment-lodging-park-point_of_interest-rv_park-tourist_attraction,cemetery-establishment-park-point_of_interest
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,...,,,,,,,,,,
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,...,,,,,,,,,,
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,...,,,,,,,,,,
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,...,,,,,,,,,,
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,...,,,,,,,,,,


In [13]:
# print out the statistical information of merged_data table
merged_data.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,postal_code,postal_code_suffix
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0,20454.0,14095.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909,92996.901926,3840.22547
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874,1858.067396,2174.426227
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0,85344.0,110.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0,91505.25,2214.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0,92840.0,3340.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0,94601.0,4931.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0,96161.0,9859.0


In [14]:
# Displays summary information about the merged_data
merged_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 38 columns):
 #   Column                                                                              Non-Null Count  Dtype  
---  ------                                                                              --------------  -----  
 0   longitude                                                                           20640 non-null  float64
 1   latitude                                                                            20640 non-null  float64
 2   housing_median_age                                                                  20640 non-null  float64
 3   total_rooms                                                                         20640 non-null  float64
 4   total_bedrooms                                                                      20433 non-null  float64
 5   population                                                                          20640 non-n

In [15]:
merged_data.columns

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity', 'street_number', 'route',
       'locality-political', 'administrative_area_level_2-political',
       'administrative_area_level_1-political', 'country-political',
       'postal_code', 'address', 'neighborhood-political',
       'postal_code_suffix', 'establishment-point_of_interest-transit_station',
       'establishment-park-point_of_interest', 'premise',
       'establishment-point_of_interest-subway_station-transit_station',
       'airport-establishment-finance-moving_company-point_of_interest-storage',
       'subpremise',
       'bus_station-establishment-point_of_interest-transit_station',
       'establishment-park-point_of_interest-tourist_attraction',
       'establishment-natural_feature',
       'airport-establishment-point_of_interest',
       'political-sublocality-sublocality_level_1'

In [16]:
merged_data['ocean_proximity'].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

In [17]:
import pandas as pd
import numpy as np

# Clean the ocean_proximity data by trimming whitespace and converting to lowercase
#merged_data['ocean_proximity'] = merged_data['ocean_proximity'].str.strip().str.lower()

# Applying pd.get_dummies to perform one-hot encoding on 'ocean_proximity'
ocean_proximity_encoded = pd.get_dummies(merged_data['ocean_proximity'], dtype = "int")


ocean_proximity_encoded.rename(columns={
    '<1H OCEAN': 'less_1_ocean',
    'INLAND': 'inland',
    'ISLAND': 'island',
    'NEAR BAY': 'near_bay',
    'NEAR OCEAN': 'near_ocean'
}, inplace=True)


# Concatenate the new one-hot encoded DataFrame with the original DataFrame
merged_data = pd.concat([merged_data, ocean_proximity_encoded], axis=1)


# Drop the original 'ocean_proximity' column to avoid redundancy
merged_data.drop('ocean_proximity', axis=1, inplace=True)


# Group by neighborhood-political to calculate aggregates
grouped = merged_data.groupby('neighborhood-political')

# Fill NaN values before grouping 
merged_data['households'] = merged_data['households'].fillna(merged_data['households'].mean())
merged_data['total_bedrooms'] = merged_data['total_bedrooms'].fillna(merged_data['total_bedrooms'].mean())

# Apply group transforms and handle NaN values immediately after
transformed_households = grouped['households'].transform('mean')
transformed_households.fillna(transformed_households.mean(), inplace=True)

transformed_median_house_value = grouped['median_house_value'].transform('mean')
transformed_median_house_value.fillna(transformed_median_house_value.mean(), inplace=True)

# Calculate median house value and cap at 500,000
merged_data['median_house_value'] = transformed_median_house_value.clip(upper=500000)

# Calculate and discretize median house age
transformed_median_house_age = grouped['housing_median_age'].transform('mean')
transformed_median_house_age.fillna(transformed_median_house_age.mean(), inplace=True)
merged_data['median_house_age_group'] = (transformed_median_house_age // 10) * 10

# Ensure total households are integers by rounding up
merged_data['total_households'] = np.ceil(transformed_households).astype(int)

# Calculate bedrooms per household and impute missing values
merged_data['bedrooms_per_household'] = merged_data['total_bedrooms'] / merged_data['households']
avg_bedrooms_per_locality = grouped['bedrooms_per_household'].transform('mean')
avg_bedrooms_per_locality = avg_bedrooms_per_locality.fillna(avg_bedrooms_per_locality.mean())
merged_data['bedrooms_per_household'] = merged_data['bedrooms_per_household'].fillna(avg_bedrooms_per_locality)

# Encode locality-political
merged_data['locality_code'] = merged_data['locality-political'].astype('category').cat.codes


#print(merged_data.columns)

merged_data.rename(columns={'ocean_<1h_ocean': 'ocean_less_than_1h_ocean'}, inplace=True)

# Select relevant columns for the feature group using the new column name
neighborhood_features = merged_data[['neighborhood-political', 'less_1_ocean', 'inland', 
                                     'island', 'near_bay', 'near_ocean', 'median_house_value', 
                                     'median_house_age_group', 'total_households', 'bedrooms_per_household', 'locality_code']]

# Drop rows where 'neighborhood-political' is null to clean up the data
neighborhood_features_cleaned = neighborhood_features.dropna(subset=['neighborhood-political'])

# Display the cleaned DataFrame
print(neighborhood_features_cleaned.head())




  neighborhood-political  less_1_ocean  inland  island  near_bay  near_ocean  \
1             Merriewood             0       0       0         1           0   
2        Upper Rockridge             0       0       0         1           0   
3              Rockridge             0       0       0         1           0   
4              Rockridge             0       0       0         1           0   
5              Rockridge             0       0       0         1           0   

   median_house_value  median_house_age_group  total_households  \
1       328500.000000                    30.0               797   
2       377557.285714                    40.0               358   
3       292483.333333                    50.0               425   
4       292483.333333                    50.0               425   
5       292483.333333                    50.0               425   

   bedrooms_per_household  locality_code  
1                0.971880            625  
2                1.073446     

In [18]:
neighborhood_features_cleaned['less_1_ocean'].value_counts()

less_1_ocean
1    4651
0    4349
Name: count, dtype: int64

In [19]:
neighborhood_features_cleaned['inland'].value_counts()

inland
0    7506
1    1494
Name: count, dtype: int64

In [20]:
neighborhood_features_cleaned['island'].value_counts()

island
0    9000
Name: count, dtype: int64

In [21]:
neighborhood_features_cleaned['near_bay'].value_counts()

near_bay
0    7590
1    1410
Name: count, dtype: int64

In [22]:
neighborhood_features_cleaned['near_ocean'].value_counts()

near_ocean
0    7555
1    1445
Name: count, dtype: int64

In [23]:
neighborhood_features_cleaned.columns

Index(['neighborhood-political', 'less_1_ocean', 'inland', 'island',
       'near_bay', 'near_ocean', 'median_house_value',
       'median_house_age_group', 'total_households', 'bedrooms_per_household',
       'locality_code'],
      dtype='object')

In [24]:
neighborhood_features_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9000 entries, 1 to 20636
Data columns (total 11 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   neighborhood-political  9000 non-null   object 
 1   less_1_ocean            9000 non-null   int64  
 2   inland                  9000 non-null   int64  
 3   island                  9000 non-null   int64  
 4   near_bay                9000 non-null   int64  
 5   near_ocean              9000 non-null   int64  
 6   median_house_value      9000 non-null   float64
 7   median_house_age_group  9000 non-null   float64
 8   total_households        9000 non-null   int64  
 9   bedrooms_per_household  9000 non-null   float64
 10  locality_code           9000 non-null   int16  
dtypes: float64(3), int16(1), int64(6), object(1)
memory usage: 791.0+ KB


In [25]:
print(neighborhood_features_cleaned[neighborhood_features_cleaned['neighborhood-political']=='Brooktree'], neighborhood_features_cleaned[neighborhood_features_cleaned['neighborhood-political']== "Fisherman's Wharf"], neighborhood_features_cleaned[neighborhood_features_cleaned['neighborhood-political']=='Los Osos'])


      neighborhood-political  less_1_ocean  inland  island  near_bay  \
17825              Brooktree             1       0       0         0   

       near_ocean  median_house_value  median_house_age_group  \
17825           0            257400.0                     0.0   

       total_households  bedrooms_per_household  locality_code  
17825              1438                0.374041            787         neighborhood-political  less_1_ocean  inland  island  near_bay  \
15616      Fisherman's Wharf             0       0       0         1   

       near_ocean  median_house_value  median_house_age_group  \
15616           0            500000.0                    50.0   

       total_households  bedrooms_per_household  locality_code  
15616               250                   1.268            781         neighborhood-political  less_1_ocean  inland  island  near_bay  \
16628               Los Osos             0       0       0         0   
16629               Los Osos             0  

In [26]:

# Create boolean masks for each neighborhood
brooktree_mask = neighborhood_features_cleaned['neighborhood-political'] == 'Brooktree'
fishermans_wharf_mask = neighborhood_features_cleaned['neighborhood-political'] == "Fisherman's Wharf"
los_osos_mask = neighborhood_features_cleaned['neighborhood-political'] == 'Los Osos'

# Filter data for each neighborhood
brooktree_data = neighborhood_features_cleaned[brooktree_mask]
fishermans_wharf_data = neighborhood_features_cleaned[fishermans_wharf_mask]
los_osos_data = neighborhood_features_cleaned[los_osos_mask]

# Concatenate the results into a single DataFrame
result_df = pd.concat([brooktree_data, fishermans_wharf_data, los_osos_data])

# Display the concatenated DataFrame
result_df


Unnamed: 0,neighborhood-political,less_1_ocean,inland,island,near_bay,near_ocean,median_house_value,median_house_age_group,total_households,bedrooms_per_household,locality_code
17825,Brooktree,1,0,0,0,0,257400.0,0.0,1438,0.374041,787
15616,Fisherman's Wharf,0,0,0,1,0,500000.0,50.0,250,1.268,781
16628,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.042384,55
16629,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.031722,55
16630,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.062121,55
16631,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.030651,55
16633,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.095628,55
16634,Los Osos,0,0,0,0,1,221612.5,10.0,612,0.990148,55
16635,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.03481,55
16636,Los Osos,0,0,0,0,1,221612.5,10.0,612,1.095611,55


### Ingest data into FeatureStore

In this step we will create the FeatureGroups representing the transaction and identity tables.


### Define FeatureGroups

In [27]:
from time import gmtime, strftime, sleep

neighborhood_politics_feature_group_name = "neighborhood_politics_feature_group-" + strftime("%d-%H-%M-%S", gmtime())


In [28]:
from sagemaker.feature_store.feature_group import FeatureGroup

neighborhood_politics_feature_group = FeatureGroup(
    name=neighborhood_politics_feature_group_name, sagemaker_session=feature_store_session
)
#transaction_feature_group = FeatureGroup(name=transaction_feature_group_name, sagemaker_session=feature_store_session)

In [29]:
import time

current_time_sec = int(round(time.time()))


def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        if data_frame.dtypes[label] == "object":
            data_frame[label] = data_frame[label].astype("str").astype("string")


# cast object dtype to string. The SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.
cast_object_to_string(neighborhood_features_cleaned)

# record identifier and event time feature names
neighborhood_identifier_feature_name = "neighborhood-political"
event_time_feature_name = "event_time"

# append EventTime feature
neighborhood_features_cleaned[event_time_feature_name] = pd.Series(
    [current_time_sec] * len(neighborhood_features_cleaned),
    index=neighborhood_features_cleaned.index,  # Ensure the series aligns with the DataFrame's index
    dtype="float64"
)


neighborhood_features_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9000 entries, 1 to 20636
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   neighborhood-political  9000 non-null   string 
 1   less_1_ocean            9000 non-null   int64  
 2   inland                  9000 non-null   int64  
 3   island                  9000 non-null   int64  
 4   near_bay                9000 non-null   int64  
 5   near_ocean              9000 non-null   int64  
 6   median_house_value      9000 non-null   float64
 7   median_house_age_group  9000 non-null   float64
 8   total_households        9000 non-null   int64  
 9   bedrooms_per_household  9000 non-null   float64
 10  locality_code           9000 non-null   int16  
 11  event_time              9000 non-null   float64
dtypes: float64(4), int16(1), int64(6), string(1)
memory usage: 861.3 KB


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_frame[label] = data_frame[label].astype("str").astype("string")
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  neighborhood_features_cleaned[event_time_feature_name] = pd.Series(


In [30]:
# load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
neighborhood_politics_feature_group.load_feature_definitions(data_frame=neighborhood_features_cleaned)

[FeatureDefinition(feature_name='neighborhood-political', feature_type=<FeatureTypeEnum.STRING: 'String'>, collection_type=None),
 FeatureDefinition(feature_name='less_1_ocean', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='inland', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='island', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='near_bay', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='near_ocean', feature_type=<FeatureTypeEnum.INTEGRAL: 'Integral'>, collection_type=None),
 FeatureDefinition(feature_name='median_house_value', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=None),
 FeatureDefinition(feature_name='median_house_age_group', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>, collection_type=Non

### Create FeatureGroups in SageMaker FeatureStore

In [31]:
def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")


neighborhood_politics_feature_group.create(
    s3_uri=f"s3://{default_s3_bucket_name}/{prefix}",
    record_identifier_name=neighborhood_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=role,
    enable_online_store=True,
)


wait_for_feature_group_creation_complete(feature_group=neighborhood_politics_feature_group)


Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
FeatureGroup neighborhood_politics_feature_group-28-22-08-36 successfully created.


Confirm the FeatureGroup has been created by using the DescribeFeatureGroup and ListFeatureGroups APIs.

In [32]:
neighborhood_politics_feature_group.describe()

{'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:904981812149:feature-group/neighborhood_politics_feature_group-28-22-08-36',
 'FeatureGroupName': 'neighborhood_politics_feature_group-28-22-08-36',
 'RecordIdentifierFeatureName': 'neighborhood-political',
 'EventTimeFeatureName': 'event_time',
 'FeatureDefinitions': [{'FeatureName': 'neighborhood-political',
   'FeatureType': 'String'},
  {'FeatureName': 'less_1_ocean', 'FeatureType': 'Integral'},
  {'FeatureName': 'inland', 'FeatureType': 'Integral'},
  {'FeatureName': 'island', 'FeatureType': 'Integral'},
  {'FeatureName': 'near_bay', 'FeatureType': 'Integral'},
  {'FeatureName': 'near_ocean', 'FeatureType': 'Integral'},
  {'FeatureName': 'median_house_value', 'FeatureType': 'Fractional'},
  {'FeatureName': 'median_house_age_group', 'FeatureType': 'Fractional'},
  {'FeatureName': 'total_households', 'FeatureType': 'Integral'},
  {'FeatureName': 'bedrooms_per_household', 'FeatureType': 'Fractional'},
  {'FeatureName': 'locality_code',

In [33]:
sagemaker_client.list_feature_groups()  # use boto client to list FeatureGroups


{'FeatureGroupSummaries': [{'FeatureGroupName': 'neighborhood_politics_feature_group-28-22-08-36',
   'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:904981812149:feature-group/neighborhood_politics_feature_group-28-22-08-36',
   'CreationTime': datetime.datetime(2024, 5, 28, 22, 8, 37, 127000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created'},
  {'FeatureGroupName': 'neighborhood_politics_feature_group-28-21-56-47',
   'FeatureGroupArn': 'arn:aws:sagemaker:us-east-1:904981812149:feature-group/neighborhood_politics_feature_group-28-21-56-47',
   'CreationTime': datetime.datetime(2024, 5, 28, 21, 57, 4, 661000, tzinfo=tzlocal()),
   'FeatureGroupStatus': 'Created',
   'OfflineStoreStatus': {'Status': 'Active'}}],
 'ResponseMetadata': {'RequestId': '0c90a5ee-2dbc-495e-bbe5-338b46d3f2bd',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '0c90a5ee-2dbc-495e-bbe5-338b46d3f2bd',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '582',
   'date': 'Tue, 2

### PutRecords into FeatureGroup


After the FeatureGroups have been created, we can put data into the FeatureGroups by using the PutRecord API. This API can handle high TPS and is designed to be called by different streams. The data from all of these Put requests is buffered and written to S3 in chunks. The files will be written to the offline store within a few minutes of ingestion. For this example, to accelerate the ingestion process, we are specifying multiple workers to do the job simultaneously. It will take ~1min to ingest data to the 2 FeatureGroups, respectively.

In [34]:
neighborhood_politics_feature_group.ingest(data_frame=neighborhood_features_cleaned, max_workers=3, wait=True)


IngestionManagerPandas(feature_group_name='neighborhood_politics_feature_group-28-22-08-36', feature_definitions={'neighborhood-political': {'FeatureName': 'neighborhood-political', 'FeatureType': 'String'}, 'less_1_ocean': {'FeatureName': 'less_1_ocean', 'FeatureType': 'Integral'}, 'inland': {'FeatureName': 'inland', 'FeatureType': 'Integral'}, 'island': {'FeatureName': 'island', 'FeatureType': 'Integral'}, 'near_bay': {'FeatureName': 'near_bay', 'FeatureType': 'Integral'}, 'near_ocean': {'FeatureName': 'near_ocean', 'FeatureType': 'Integral'}, 'median_house_value': {'FeatureName': 'median_house_value', 'FeatureType': 'Fractional'}, 'median_house_age_group': {'FeatureName': 'median_house_age_group', 'FeatureType': 'Fractional'}, 'total_households': {'FeatureName': 'total_households', 'FeatureType': 'Integral'}, 'bedrooms_per_household': {'FeatureName': 'bedrooms_per_household', 'FeatureType': 'Fractional'}, 'locality_code': {'FeatureName': 'locality_code', 'FeatureType': 'Integral'}, 

In [35]:
### Grabbing the Record from the online store
record_identifier_value = 'Brooktree'

featurestore_runtime.get_record(
    FeatureGroupName=neighborhood_politics_feature_group_name,
    RecordIdentifierValueAsString=record_identifier_value,
)

{'ResponseMetadata': {'RequestId': 'ea31dc2a-a2db-4aac-aa97-e668bc8ca515',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ea31dc2a-a2db-4aac-aa97-e668bc8ca515',
   'content-type': 'application/json',
   'content-length': '1012',
   'date': 'Tue, 28 May 2024 22:09:39 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'neighborhood-political',
   'ValueAsString': 'Brooktree'},
  {'FeatureName': 'less_1_ocean', 'ValueAsString': '1'},
  {'FeatureName': 'inland', 'ValueAsString': '0'},
  {'FeatureName': 'island', 'ValueAsString': '0'},
  {'FeatureName': 'near_bay', 'ValueAsString': '0'},
  {'FeatureName': 'near_ocean', 'ValueAsString': '0'},
  {'FeatureName': 'median_house_value', 'ValueAsString': '257400.0'},
  {'FeatureName': 'median_house_age_group', 'ValueAsString': '0.0'},
  {'FeatureName': 'total_households', 'ValueAsString': '1438'},
  {'FeatureName': 'bedrooms_per_household',
   'ValueAsString': '0.3740407180372474'},
  {'FeatureName': 'locality_code', 'Value

In [36]:
### Grabbing the Record from the online store
record_identifier_value = "Fisherman's Wharf"

featurestore_runtime.get_record(
    FeatureGroupName=neighborhood_politics_feature_group_name,
    RecordIdentifierValueAsString=record_identifier_value,
)

{'ResponseMetadata': {'RequestId': '767f6797-d729-4f6e-91f4-a407ea6498f4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '767f6797-d729-4f6e-91f4-a407ea6498f4',
   'content-type': 'application/json',
   'content-length': '1007',
   'date': 'Tue, 28 May 2024 22:09:39 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'neighborhood-political',
   'ValueAsString': "Fisherman's Wharf"},
  {'FeatureName': 'less_1_ocean', 'ValueAsString': '0'},
  {'FeatureName': 'inland', 'ValueAsString': '0'},
  {'FeatureName': 'island', 'ValueAsString': '0'},
  {'FeatureName': 'near_bay', 'ValueAsString': '1'},
  {'FeatureName': 'near_ocean', 'ValueAsString': '0'},
  {'FeatureName': 'median_house_value', 'ValueAsString': '500000.0'},
  {'FeatureName': 'median_house_age_group', 'ValueAsString': '50.0'},
  {'FeatureName': 'total_households', 'ValueAsString': '250'},
  {'FeatureName': 'bedrooms_per_household', 'ValueAsString': '1.268'},
  {'FeatureName': 'locality_code', 'ValueAsString

In [37]:
### Grabbing the Record from the online store
record_identifier_value =  'Los Osos'

featurestore_runtime.get_record(
    FeatureGroupName=neighborhood_politics_feature_group_name,
    RecordIdentifierValueAsString=record_identifier_value,
)

{'ResponseMetadata': {'RequestId': 'ac672514-1209-4b00-9b47-a0cc25ee19d8',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'ac672514-1209-4b00-9b47-a0cc25ee19d8',
   'content-type': 'application/json',
   'content-length': '1010',
   'date': 'Tue, 28 May 2024 22:09:39 GMT'},
  'RetryAttempts': 0},
 'Record': [{'FeatureName': 'neighborhood-political',
   'ValueAsString': 'Los Osos'},
  {'FeatureName': 'less_1_ocean', 'ValueAsString': '0'},
  {'FeatureName': 'inland', 'ValueAsString': '0'},
  {'FeatureName': 'island', 'ValueAsString': '0'},
  {'FeatureName': 'near_bay', 'ValueAsString': '0'},
  {'FeatureName': 'near_ocean', 'ValueAsString': '1'},
  {'FeatureName': 'median_house_value', 'ValueAsString': '221612.5'},
  {'FeatureName': 'median_house_age_group', 'ValueAsString': '10.0'},
  {'FeatureName': 'total_households', 'ValueAsString': '612'},
  {'FeatureName': 'bedrooms_per_household',
   'ValueAsString': '1.0956112852664577'},
  {'FeatureName': 'locality_code', 'ValueA