# Preprocess Genomic Data

This notebook is used to read genomic data (in tabular format) from S3 and store features in SageMaker FeatureStore.

## Step 1: Read in the SageMaker JumpStart Solution configuration

In [None]:
import json

SOLUTION_CONFIG = json.load(open("stack_outputs.json"))
SOLUTION_BUCKET = SOLUTION_CONFIG["SolutionS3Bucket"]
REGION = SOLUTION_CONFIG["AWSRegion"]
SOLUTION_PREFIX = SOLUTION_CONFIG["SolutionPrefix"]
SOLUTION_NAME = SOLUTION_CONFIG["SolutionName"]
BUCKET = SOLUTION_CONFIG["S3Bucket"]

## Step 2: Download and read in the genomic dataset

The dataset file `GSE103584_R01_NSCLC_RNAseq.txt` used here was pre-processed using open-source tools and made available on TCIA. It used STAR v.2.3 for alignment and Cufflinks v.2.0.2 for expression calls. Further details can be found in [1]. The original dataset can also be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE103584

[1] Zhou, Mu, et al. "Non–small cell lung cancer radiogenomics map identifies relationships between molecular and imaging phenotypes with prognostic implications." Radiology 286.1 (2018): 307-315.

#### Download the input data from S3

In [None]:
import pandas as pd

file_name = "GSE103584_R01_NSCLC_RNAseq.txt"

input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
input_data = f"{input_data_bucket}/{file_name}"
!aws s3 cp $input_data .

We will read the txt file and create a subset by removing case IDs and genes that are not relevant.

In [None]:
import pandas as pd

gen_data = pd.read_csv(file_name, delimiter = '\t')

# Remove case IDs that do not have weight and pack/years in clinical data 
drop_cases = ['R01-003', 'R01-004', 'R01-006', 'R01-007', 'R01-015', 'R01-016', 'R01-018', 'R01-022', 'R01-023', 'R01-098', 'R01-105']

gen_data = gen_data.drop(drop_cases, axis = 1)

# Add column name for genes 
gen_data.rename(columns={'Unnamed: 0':'index_temp'}, inplace=True)

# Transpose the dataframe such that rows = case IDs and cols = genes 
gen_data.set_index('index_temp', inplace=True)
gen_data_t = gen_data.transpose()
gen_data_t.reset_index(inplace=True)
gen_data_t.rename(columns={'index':'Case_ID'}, inplace=True)


We keep the genes suggested in Zhou, Mu, et al. [1]. These are genes corresponding to Metagenes 19, 10, 9, 4, 3, 21 in Table 2 of the paper (https://pubs.rsna.org/doi/pdf/10.1148/radiol.2017161845).

[1] Zhou, Mu, et al. "Non–small cell lung cancer radiogenomics map identifies relationships between molecular and imaging phenotypes with prognostic implications." Radiology 286.1 (2018): 307-315.

In [None]:
# Selecting the columns corresponsind to the subset of genes and case_ID

selected_columns = ['Case_ID','LRIG1', 'HPGD', 'GDF15', 'CDH2', 'POSTN', 'VCAN', 'PDGFRA', 'VCAM1', 'CD44', 'CD48', 'CD4', 'LYL1', 'SPI1', 'CD37', 'VIM', 'LMO2', 'EGR2', 'BGN', 'COL4A1', 'COL5A1', 'COL5A2']

gen_data_t = gen_data_t[selected_columns]

# Replace NaN with 0
data_gen = gen_data_t.fillna(0)


## Step 3: Create SageMaker FeatureStore

Firstly, we cast the object dtype to string which will then map to String feature type in the SageMaker FeatureStore. We add `record_identifier_feature_name` and `event_time_feature_name` columns to the dataset for creating the feature store.

In [None]:
import time
import pandas as pd

def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        if data_frame.dtypes[label] == "object":
            data_frame[label] = data_frame[label].astype("str").astype("string")

# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.
cast_object_to_string(data_gen)

# Record identifier and event time feature names
record_identifier_feature_name = "Case_ID"
event_time_feature_name = "EventTime"

current_time_sec = int(round(time.time()))

# Append EventTime feature
data_gen[event_time_feature_name] = pd.Series([current_time_sec]*len(data_gen), dtype="float64")


Next step, we define the FeatureGroup and load feature definitions to the feature group.

In [None]:
import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup

boto_session = boto3.Session(region_name=REGION)
sagemaker_client = boto_session.client(service_name="sagemaker", region_name=REGION)
featurestore_runtime = boto_session.client(service_name="sagemaker-featurestore-runtime", region_name=REGION)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

genomic_feature_group_name = f"{SOLUTION_PREFIX}-genomic-feature-group"
%store genomic_feature_group_name

genomic_feature_group = FeatureGroup(name=genomic_feature_group_name, sagemaker_session=feature_store_session)

# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
genomic_feature_group.load_feature_definitions(data_frame=data_gen) # output is suppressed

We create the FeatureGroup for the genomic dataset with both online and offline store enabled.

In [None]:
%%time

def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")


prefix = "genomic"

genomic_feature_group.create(
    s3_uri=f"s3://{BUCKET}/{prefix}",
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=sagemaker.get_execution_role(),
    enable_online_store=True
)

wait_for_feature_group_creation_complete(feature_group=genomic_feature_group)

After the feature group is created, we can ingest the genomic dataset to its feature group.

In [None]:
genomic_feature_group.ingest(
    data_frame=data_gen, max_workers=3, wait=True
)

## Next Stage

Next, we'll take a look at preparing the clinical data.

Click here to [continue](./2_preprocess_clinical_data.ipynb).