# Preprocess Clinical Data

The NSCLC Radiogenomic clinical data is in a structured tabular form, as is common for EHR data extracts and health insurance claims data. The NSCLC clinical data consists demographic (gender, ethnicity) and health-behavior (smoking history) information, cancer recurrence status, histology, histopathological grading, pathological TNM staging, and survival outcome. We are going to apply preprocessing steps to transform the categorical features, and remove features that are either redundent or are deemed target leakage features. After the features are preprocessed, we create a feature group in SageMaker Feature Store and ingest the features to the feature store which serves as a central repository for features from all three modalities.

## Step 1: Read in the SageMaker JumpStart Solution configuration

In [None]:
import json

SOLUTION_CONFIG = json.load(open("stack_outputs.json"))
SOLUTION_BUCKET = SOLUTION_CONFIG["SolutionS3Bucket"]
REGION = SOLUTION_CONFIG["AWSRegion"]
SOLUTION_PREFIX = SOLUTION_CONFIG["SolutionPrefix"]
SOLUTION_NAME = SOLUTION_CONFIG["SolutionName"]
BUCKET = SOLUTION_CONFIG["S3Bucket"]

## Step 2: Download and read in the clinical dataset

#### Download the input data from S3

In [None]:
import pandas as pd

file_name = "NSCLCR01Radiogenomic_DATA_LABELS_2018-05-22_1500-shifted.csv"

input_data_bucket = f"s3://{SOLUTION_BUCKET}-{REGION}/{SOLUTION_NAME}/data"
input_data = f"{input_data_bucket}/{file_name}"
!aws s3 cp $input_data .

data_clinical = pd.read_csv(file_name)

## Step 3: Data Preprocessing

We focus our analysis on cases with prefix `R01-*` in the `Case ID` because they have corresponding medical imaging and genomic data. For feature preprocessing, we do the following:
1. Removing imaging date features `CT Date` & `PET Date` as they are irrelevant to the modeling.
2. Removing recurrence/survival related features such as `Date of Recurrence` and `Date of Death` as they carry target leakage.
3. Transform categorical features with one-hot encoding.
4. Drop samples whose `Weightlbs` and `PackYears` are `Not Collected`. 
5. Fill `NaN` with 0.

In [None]:
# Keep samples starting with "R01-*" as these IDs have corresponding medical imaging data. Delete samples with Case IDs "AMC-*". 
data_clinical = data_clinical[~data_clinical["Case ID"].str.contains("AMC")]

# Delete columns with ID and dates
list_delete_cols = ['Quit Smoking Year', 'Date of Recurrence', 'Date of Last Known Alive', 'Date of Death', 'CT Date', 'PET Date']
data_clinical.drop(list_delete_cols, axis=1, inplace=True)

# List of features with catergorical value
list_encode_cols = ["Patient affiliation", "Gender", "Ethnicity", "Smoking status", "%GG", "Tumor Location (choice=RUL)", "Tumor Location (choice=RML)", "Tumor Location (choice=RLL)", "Tumor Location (choice=LUL)", "Tumor Location (choice=LLL)", "Tumor Location (choice=L Lingula)", "Tumor Location (choice=Unknown)", "Histology ", "Pathological T stage", "Pathological N stage", "Pathological M stage", "Histopathological Grade", "Lymphovascular invasion", "Pleural invasion (elastic, visceral, or parietal)", "EGFR mutation status", "KRAS mutation status", "ALK translocation status", "Adjuvant Treatment", "Chemotherapy", "Radiation", "Recurrence", "Recurrence Location"]

# List of features with numeric value
list_nonenc_cols = ["Case ID", "Age at Histological Diagnosis", "Weight (lbs)", "Pack Years", "Time to Death (days)", "Days between CT and surgery", "Survival Status"]

# One-hot encoding of features with categorical value
data_clinical_enc = pd.get_dummies(data_clinical[list_encode_cols])

data_clinical_nonenc = data_clinical[list_nonenc_cols]

# Combine all features
data_clin = pd.concat([data_clinical_enc, data_clinical_nonenc], axis=1)

# Feature names inside FeatureStore should not have special chars and should be < 64 chars long
# Update feature names accordingly

l_char = ['-',' ','%','/','<','>','(',')','=',',',':']

for col in (data_clin.columns):

    if (col == "Case ID"):
        data_clin.rename(columns={col: col.replace(' ','_')}, inplace = True)
        continue

    for char in l_char:
        if char in col:
            data_clin.rename(columns={col: col.replace(char,'')}, inplace = True)
            col = col.replace(char,'')
            
    if (len(col)>=64):
        data_clin.rename(columns={col: col[:60]}, inplace = True)
        
# Change label (survival status) "Dead"=1 and "Alive"=0 
data_clin["SurvivalStatus"].replace({"Dead": "1", "Alive": "0"}, inplace=True)


# Drop samples with missing values. 
# Fill NaN with 0. For eg. PackYears for non-smokers is "NA". Change it to 0.
data_clin = data_clin[data_clin['Weightlbs'] != "Not Collected"]
data_clin = data_clin[data_clin['PackYears'] != "Not Collected"]
data_clin.fillna(0)

## Step 4: Create SageMaker FeatureStore

Firstly, we cast the object dtype to string which will then map to String feature type in the SageMaker FeatureStore. We add `record_identifier_feature_name` and `event_time_feature_name` columns to the dataset for creating the feature store.

In [None]:
import time

def cast_object_to_string(data_frame):
    for label in data_frame.columns:
        print (label)
        if data_frame.dtypes[label] == 'object':
            data_frame[label] = data_frame[label].astype("str").astype("string")
            
current_time_sec = int(round(time.time()))

# Cast object dtype to string. SageMaker FeatureStore Python SDK will then map the string dtype to String feature type.
cast_object_to_string(data_clin)

# Record identifier and event time feature names
record_identifier_feature_name = "Case_ID"
event_time_feature_name = "EventTime"

# Append EventTime feature
data_clin[event_time_feature_name] = pd.Series([current_time_sec]*len(data_clin), dtype="float64")

## If event time generates NaN
data_clin[event_time_feature_name] = data_clin[event_time_feature_name].fillna(0)

Next step, we define the FeatureGroup and load feature definitions to the feature group.

In [None]:
import boto3
import sagemaker
from sagemaker.session import Session
from sagemaker.feature_store.feature_group import FeatureGroup


boto_session = boto3.Session(region_name=REGION)
sagemaker_client = boto_session.client(service_name="sagemaker", region_name=REGION)
featurestore_runtime = boto_session.client(service_name="sagemaker-featurestore-runtime", region_name=REGION)

feature_store_session = Session(
    boto_session=boto_session,
    sagemaker_client=sagemaker_client,
    sagemaker_featurestore_runtime_client=featurestore_runtime
)

clinical_feature_group_name = f"{SOLUTION_PREFIX}-clinical-feature-group"
%store clinical_feature_group_name

clinical_feature_group = FeatureGroup(name=clinical_feature_group_name, sagemaker_session=feature_store_session)

# Load feature definitions to the feature group. SageMaker FeatureStore Python SDK will auto-detect the data schema based on input data.
clinical_feature_group.load_feature_definitions(data_frame=data_clin) # output is suppressed

We create the FeatureGroup for the clinical dataset with both online and offline store enabled.

In [None]:
%%time

def wait_for_feature_group_creation_complete(feature_group):
    status = feature_group.describe().get("FeatureGroupStatus")
    while status == "Creating":
        print("Waiting for Feature Group Creation")
        time.sleep(5)
        status = feature_group.describe().get("FeatureGroupStatus")
    if status != "Created":
        raise RuntimeError(f"Failed to create feature group {feature_group.name}")
    print(f"FeatureGroup {feature_group.name} successfully created.")
    
prefix = "clinical"

clinical_feature_group.create(
    s3_uri=f"s3://{BUCKET}/{prefix}",
    record_identifier_name=record_identifier_feature_name,
    event_time_feature_name=event_time_feature_name,
    role_arn=sagemaker.get_execution_role(),
    enable_online_store=True
)

wait_for_feature_group_creation_complete(feature_group=clinical_feature_group)

After the feature group is created, we can ingest the clinical dataset to its feature group.

In [None]:
clinical_feature_group.ingest(
    data_frame=data_clin, max_workers=3, wait=True
)

## Next Stage

Next, we'll take a look at preparing the clinical data.

Click here to [continue](./3_preprocess_imaging_data.ipynb).