로그인

In [1]:
#pip install synapseclient
import synapseclient
import os
import numpy as np
syn = synapseclient.Synapse()
syn.login('sungmk86', 'deepbio')

Welcome, sungmk86!


data set 선택
* GBM의 training set과 test set의 sample list를 선택. miRNA data 와 Survival data 선택

In [2]:
ACRONYM = 'KIRC'
trainLabelsId = "syn1714093"   # Training bootstraps for KIRC
testLabelsId = "syn1714090"    # Testing boostraps for KIRC
dataId = "syn1710291"          # for miRNA KIRC data
survivalDataId = 'syn1710303'


def readFile(entity, strip=None):
    with open(os.path.join(entity['cacheDir'], entity['files'][0])) as f:
        data = np.asarray([l.strip(strip).split('\t') for l in f])
    return data

def match(seq1, seq2):
    """Finds the index locations of seq1 in seq2"""
    return [ np.nonzero(seq2==x)[0][0] for x in seq1  if x in seq2 ]

### synapse로부터 data download
위에서 정의한 readFile 함수를 이용하여 data download

 1 sample list
 
 아래와 같은 list가 (samples x 100) matrix 형태로 들어있음. 100개의 각각의 column은 중복허용하여 뽑은 subsampling set에 해당한다.
 
> TCGA-41-2571	TCGA-32-4719	TCGA-27-2519	TCGA-06-1084	TCGA-27-2521 ...

In [3]:
#Download bootstrap labels
testLabels = readFile(syn.get(testLabelsId))
trainLabels = readFile(syn.get(trainLabelsId))

 2 miRNA data 
 
  - 첫 번째 row - features (243 features)
 
  - 첫 번째 column - samples (1045 samples)


In [4]:
#Download specific data
data = readFile(syn.get(dataId))
features=data[0,1:]
samples=data[1:,0]
data=data[1:,1:].astype(np.float).T
print data.shape

(1045, 243)


 3 survival data 
 
  - 첫 번째 column - 210 samples
  - 두 번째 column - OS_OS (생존기간)
  - 세 번째 column - OS_vital_status ( 0, 1 = death)

In [5]:
#Download and extract the survival data
survival=readFile(syn.get(survivalDataId), '\n')
survTime = survival[1:,1].astype(np.int)
survStatus = survival[1:,2].astype(np.int)

loading 패키지들

예전 코드

In [10]:
#%load_ext rmagic
#%R require(survival); require(randomSurvivalForest); require(survcomp)

# 코드 수정1

In [7]:
%load_ext rpy2.ipython
%R require(survival); require(randomForestSRC); require(survcomp) 

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


array([1], dtype=int32)

 100개 ( trainLabels.shpae[1]== 100 )의 model을 training함

In [8]:
predictions=[]
bootstrapIdx=1  # subsampling ID
#Determine Extract the training and testing sets of one bootstrap
trainIdx = match(trainLabels[:,bootstrapIdx], samples)
testIdx = match(testLabels[:,bootstrapIdx], samples)

#Verify that the labels are the same
assert (np.all(trainLabels[:,bootstrapIdx]==samples[trainIdx]) and 
        np.all(testLabels[:,bootstrapIdx]==samples[testIdx]))

#Exctract traing and testing set
trainData = data[:, trainIdx].T
trainSurvStatus = survStatus[trainIdx]
trainSurvTime = survTime[trainIdx]
testData = data[:, testIdx].T
testSurvStatus = survStatus[testIdx]
testSurvTime = survTime[testIdx]

concordance.index - survival or binary class prediction 을 위해 concordance index(C-index)를 계산하는 함수

risk prediction, event time, event occurence 를 인자로 받음

- risk prediction = estimate for mortality = [ensemble mortality](https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908043)
    - ensemble mortality for i는, xi번째 사람의 전체 time point(Tj)에서의 ensemble CHF값을 합한 값.
- CHF(cumulative hazard function)

이전 코드

In [None]:
#Push to R, model and predict
# %Rpush trainData trainSurvStatus trainSurvTime testData testSurvStatus testSurvTime
# %R rsf.model.fit <- rsf(Surv(time,status) ~ ., data=data.frame(time=trainSurvTime,status=trainSurvStatus, trainData), ntree=1000, na.action="na.impute", splitrule="logrank", nsplit=1, importance="randomsplit", seed=-1)
# %R -o predictedResponse predictedResponse <- predict(rsf.model.fit, data.frame(testData), na.action="na.impute")$mortality

# 코드 수정2

In [9]:
# Push to R, model and predict
%Rpush trainData trainSurvStatus trainSurvTime testData testSurvStatus testSurvTime
%R rsf.model.fit <- rfsrc(Surv(time,status) ~ ., data=data.frame(time=trainSurvTime,status=trainSurvStatus, trainData), ntree=1000, na.action="na.impute", splitrule="logrank", nsplit=1, importance="random", seed=-1)
%R -o predictedResponse predictedResponse <- apply(predict(rsf.model.fit, data.frame(testData), na.action="na.impute")$chf,1,sum)

array([ 14.9245    ,   8.29401667,  23.6708    ,  23.71368333,
        16.51455   ,  30.88895   ,  18.84501667,  18.11331667,
        26.76788333,  11.60828333,  14.9641    ,  11.71976667,
        11.64951667,  12.62608333,   9.57088333,  15.1263    ,
        22.25023333,  21.70128333,  26.78588333,  24.41805   ,
        16.14308333,  14.61138333,  15.76896667,  31.54161667,
        28.09238333,   8.87148333,  30.99628333,  14.01831667,
        12.76513333,  52.56321667,  19.85733333,  18.35186667,
        19.70491667,   7.67951667,  17.37926667,  41.37675   ,
        28.2649    ,  16.1784    ,  12.00416667,  18.8883    ,
        29.35845   ,  11.36613333,  25.911     ,  39.11815   ,
        28.9217    ,  29.3444    ,  18.80491667,  27.6221    ])

concordance.index 결과

In [10]:
#TODO replace this with creating the matrix of results
%R -o concordance concordance <- concordance.index(predictedResponse, testSurvTime, testSurvStatus)$c.index
print concordance

[ 0.67652174]


In [None]:
%load_ext rpy2.ipython
%R require(survival); require(randomForestSRC); require(survcomp)

predictions=[]
Concordance=[]
for bootstrapIdx in range(trainLabels.shape[1]):
    #Determine Extract the training and testing sets of one bootstrap
    trainIdx = match(trainLabels[:,bootstrapIdx], samples)
    testIdx = match(testLabels[:,bootstrapIdx], samples)

    #Verify that the labels are the same
    assert (np.all(trainLabels[:,bootstrapIdx]==samples[trainIdx]) and
            np.all(testLabels[:,bootstrapIdx]==samples[testIdx]))

    #Exctract traing and testing set
    trainData = data[:, trainIdx].T
    trainSurvStatus = survStatus[trainIdx]
    trainSurvTime = survTime[trainIdx]
    testData = data[:, testIdx].T
    testSurvStatus = survStatus[testIdx]
    testSurvTime = survTime[testIdx]


    #Push to R, model and predict
    %Rpush trainData trainSurvStatus trainSurvTime testData testSurvStatus testSurvTime
    %R rsf.model.fit <- rfsrc(Surv(time,status) ~ ., data=data.frame(time=trainSurvTime,status=trainSurvStatus, trainData), ntree=1000, na.action="na.impute", splitrule="logrank", nsplit=1, importance="random", seed=-1)
    %R -o predictedResponse predictedResponse <- apply(predict(rsf.model.fit, data.frame(testData), na.action="na.impute")$chf,1,sum)
    #TODO replace this with creating the matrix of results
    %R -o concordance concordance <- concordance.index(predictedResponse, testSurvTime, testSurvStatus)$c.index
    print concordance
    predictions.append(predictedResponse)
    Concordance.append(concordance)