# Preprocess Genomic Data

This notebook is used to read genomic data (in tabular format) from S3 and store features in SageMaker FeatureStore.

## Step 1: Read in the SageMaker JumpStart Solution configuration
#### Setting up various configuration parameters from the provided json file.

In [1]:
import json

SOLUTION_CONFIG = json.load(open('stack_outputs.json'))
SOLUTION_BUCKET = SOLUTION_CONFIG['SolutionS3Bucket']
REGION = SOLUTION_CONFIG['AWSRegion']
SOLUTION_PREFIX = SOLUTION_CONFIG['SolutionPrefix']
SOLUTION_NAME = SOLUTION_CONFIG['SolutionName']
BUCKET = SOLUTION_CONFIG['S3Bucket']
ROLE = SOLUTION_CONFIG['IamRole']

## Step 2: Download and read in the genomic dataset
#### Reading the txt file containing the data. Then creating a subset by removing case IDs and genes that are not relevant.

In [2]:
import pandas as pd

gen_data = pd.read_csv('GSE103584_R01_NSCLC_RNAseq.txt', delimiter = '\t')

# Remove case IDs that do not have weight and pack/years in clinical data 
l_caseID_drop = ['R01-003', 'R01-004', 'R01-006', 'R01-007', 'R01-015', 'R01-016', 'R01-018', 'R01-022', 'R01-023', 'R01-098', 'R01-105']

gen_data1 = gen_data.drop(l_caseID_drop, axis = 1)

# Add column name for genes 
gen_data1.rename(columns={'Unnamed: 0':'index_temp'}, inplace=True)

# Transpose the dataframe such that rows = case IDs and cols = genes 
gen_data1.set_index('index_temp',inplace=True)
gen_data_t = gen_data1.transpose()
gen_data_t.reset_index(inplace=True)
gen_data_t.rename(columns={'index':'Case_ID'}, inplace=True)



In [3]:
# Keep the genes suggested in Zhou, Mu, et al. [1]
# These are genes corresponding to Metagenes 19, 10, 9, 4, 3, 21 in Table 2 of the paper

l_genes = ['Case_ID','LRIG1', 'HPGD', 'GDF15', 'CDH2', 'POSTN', 'VCAN', 'PDGFRA', 'VCAM1', 'CD44', 'CD48', 'CD4', 'LYL1', 'SPI1', 'CD37', 'VIM', 'LMO2', 'EGR2', 'BGN', 'COL4A1', 'COL5A1', 'COL5A2']

gen_data2 = gen_data_t[l_genes]

# Replace NaN with 0
data_gen = gen_data2.fillna(0)


## Step 3: Create SageMaker FeatureStore

Firstly, we cast the object dtype to string which will then map to String feature type in the SageMaker FeatureStore. We add `record_identifier_feature_name` and `event_time_feature_name` columns to the dataset for creating the feature store.

In [4]:
import time
import pandas as pd
from time import gmtime, strftime, sleep

# Pandas by default use `object` dtype for string values for backward compatibility. See https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html. This function converts features of string from `object` to `string`
def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        if data_frame.dtypes[label] == 'object':
            data_frame[label] = data_frame[label].astype('string')

# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.
cast_object_to_string(data_gen)

# Record identifier and event time feature names
record_identifier_feature_name = 'Case_ID'
event_time_feature_name = 'EventTime'

current_time_sec = float(round(time.time()))

# Append EventTime feature
data_gen[event_time_feature_name] = current_time_sec
# pd.Series([current_time_sec]*len(data_gen), dtype='float64')

Next step, we define the FeatureGroup and load feature definitions to the feature group.

In [5]:
import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup

boto_session = boto3.Session(region_name=REGION)
sagemaker_client = boto_session.client(service_name='sagemaker', region_name=REGION)
featurestore_runtime = boto_session.client(service_name='sagemaker-featurestore-runtime', region_name=REGION)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

genomic_feature_group_name = f'{SOLUTION_PREFIX}-genomic-feature-group'
%store genomic_feature_group_name

genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)

# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
genomic_feature_group.load_feature_definitions(data_frame=data_gen) # output is suppressed

Stored 'genomic_feature_group_name' (str)


[FeatureDefinition(feature_name='Case_ID', feature_type=<FeatureTypeEnum.STRING: 'String'>),
 FeatureDefinition(feature_name='LRIG1', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='HPGD', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='GDF15', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='CDH2', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='POSTN', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='VCAN', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='PDGFRA', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='VCAM1', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(feature_name='CD44', feature_type=<FeatureTypeEnum.FRACTIONAL: 'Fractional'>),
 FeatureDefinition(

We create the FeatureGroup for the genomic dataset with both online and offline store enabled.

In [6]:
%%time

def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get('FeatureGroupStatus')
    while status == 'Creating':
        print('Waiting for Feature Group Creation')
        time.sleep(5)
        status = feature_group.describe().get('FeatureGroupStatus')
    if status != 'Created':
        raise RuntimeError(f'Failed to create feature group {feature_group.name}')
    print(f'FeatureGroup {feature_group.name} successfully created.')


prefix = 'sagemaker-featurestore-demo'

genomic_feature_group.create(
    s3_uri=f's3://{BUCKET}/{prefix}',
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=ROLE,
    enable_online_store=True
)

wait_for_feature_group_creation_complete(feature_group=genomic_feature_group)

Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
Waiting for Feature Group Creation
FeatureGroup sagemaker-soln-lcsp-js-abc-genomic-feature-group successfully created.
CPU times: user 39.4 ms, sys: 0 ns, total: 39.4 ms
Wall time: 20.6 s


In [7]:
f's3://{BUCKET}/{prefix}'

's3://js-solution-2p/sagemaker-featurestore-demo'

After the feature group is created, we can ingest the genomic dataset to its feature group.

In [8]:
genomic_feature_group.ingest(
    data_frame=data_gen, max_workers=3, wait=True
)

IngestionManagerPandas(feature_group_name='sagemaker-soln-lcsp-js-abc-genomic-feature-group', sagemaker_fs_runtime_client_config=<botocore.config.Config object at 0x7fe7f4a90950>, max_workers=3, max_processes=1, _async_result=<multiprocess.pool.MapResult object at 0x7fe7f4a65650>, _processing_pool=<pool ProcessPool(ncpus=1)>, _failed_indices=[])

In [9]:
# Option to delete feature group later when done to free resources.
# genomic_feature_group.delete()

## Next Stage

Next, we'll take a look at preparing the clinical data.

Click here to [continue](./2_preprocess_clinical_data.ipynb).