# Create TCGA Failed Sample Data Sets
Several submissions fail. Typically this is because Bam2Fastq ran out of memory. Use the Sample.tsv to find the failed sample and create a new sample set so that we can re-run with out accidently causing needless recompute of succesful samples. 

change wdl cause everything to recompujte

- ref:
- [Make a set of data  table from scratch ](https://support.terra.bio/hc/en-us/articles/360047611871#h_01EJXZMM6GA3481YRRQBR65MY3)

- [Data Tables QuickStart Part 3: Understanding sets of data](https://support.terra.bio/hc/en-us/articles/360047611871)

- [adding data to a workspace with a template](https://support.terra.bio/hc/en-us/articles/360059242671). see "Sets of data - sample_set table"

- createFailedSampleDataSet.ipynb
    The main diffiernece is GTEx data is in a single workspace. TCGA is in multiple

# expected set format
- ID (first column) 
    The first column is the ID column, the unique name of the set. The format is '<span style="color:red">membership:</span>your-entity-name<span style="color:red">_set_id</span>'. The parts in red are required exactly as typed. The entity-name is whatever the entity you are grouping together. For the Data-QuickStart it will be specimens, but if your workflow processes samples, it could be samples. 


- Entity column
    The second column is the entity you're grouping into sets. The header must match the first column header of the table your workflow will take its single inputs from.

Given a sample.tsv file  our sample_set file is 
```
(base) $ cat sample_set_membership.tsv 
membership:sample_set_id	sample
set-d-4-panc	GTEX-11GSP-0426-SM-5A5KX
set-d-4-panc	GTEX-11I78-0626-SM-5A5LZ
set-d-4-panc	GTEX-11LCK-0226-SM-5A5M6
set-d-4-panc	GTEX-11NSD-0526-SM-5A5LT
```

In [1]:
from datetime import datetime
now = datetime.now()
today = now.strftime('%Y-%m-%d')
currentTime = now.strftime('%H:%M:%S')

# put time stamp in name. prevents data bug if we accidently run
# multiple times. with out time stamp the number of sample displayed
# by terra will be a multiple of the true number of samples
print("run on {}".format( today +  " " + currentTime ))

import numpy as np
import pandas as pd

run on 2022-03-14 14:20:52


In [2]:
# set names can not contain ':'
newDataSetName = "failedPair-" + today + "-" + now.strftime('%H-%M') 
newDataSetName

'failedPair-2022-03-14-14-20'

In [3]:
# back ups of terra data models are stored in a separate repo
# so that branch merges do not loose data model version
rootDir = "../../../terraDataModels/test-aedavids-proj/TCGA"
listOfWorkSpacePath = rootDir + "/" + "listOfWorkSpaces.csv" 
workspaceDF = pd.read_csv( listOfWorkSpacePath )
workSpaceNamesList = workspaceDF.loc[:, "wokspace"].to_list()

In [4]:
def readDataModel( rootDir, workspaceName, entityName ) :
    dataModelTSV = rootDir + "/" + workspaceName + "/" + entityName + ".tsv"
    dataModelDF = pd.read_csv(dataModelTSV, delimiter='\t')
    return dataModelDF

In [5]:
def saveDataModel( rootDir, workspaceName, entityName, dataModelDF ) :
    dataModelTSV = rootDir + "/" + workspaceName + "/" + entityName + ".tsv"
    print("writing {}".format(dataModelTSV))
    dataModelDF.to_csv(dataModelTSV, sep='\t', index=False)

In [6]:
def findSamplesWithMissingQuantFile(rootDir, workspaceName, quantFileColName):
    sampleDF = readDataModel( rootDir, workspaceName, entityName = "sample" )     
    print("\nworkspace: {}".format(workspaceName))
    
    mRNARowsLogicalPS = sampleDF['mRNASeq_fastq_path'].notna()
    nunMRNAFiles = sum( mRNARowsLogicalPS )
    print("number of mRNASeq_fastq_path files: {}".format(nunMRNAFiles))
    mRNA_DF = sampleDF.loc[mRNARowsLogicalPS,:]

    # find rows that have fastq files but are missing results. ie 'quantFilePaired' value
    passedSamplesLogicalPS = mRNA_DF[quantFileColName].notna()
    numPassed = sum(passedSamplesLogicalPS) 
    
    passedSamplesDF = mRNA_DF.loc[passedSamplesLogicalPS,:]
    
    failedSampleLogicalPS =  mRNA_DF[quantFileColName].isna()
    failedSampleDF = mRNA_DF.loc[failedSampleLogicalPS,:]
    
    numFailed = nunMRNAFiles - numPassed
   
    print("num passed:{}".format(numPassed))
    print("num failed:{}".format(numFailed))
    
    return failedSampleDF
    

def testFindSamplesWithMissingQuantFile():

    # 92 subjects in data set
    # last job, 184 workflows 79 passed, 105 failed
    #
#     workspaceName = 'TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
#     quantColName  = "quantFilePaired3"

    #
#     workspaceName = "TCGA_PAAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
#     quantColName  = "quantFilePaired3"

    workspaceName = "TCGA_PCPG_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
    quantColName = "quantFile3"
    
    failedSamplesDF = findSamplesWithMissingQuantFile(rootDir, workspaceName, quantColName)
    print("failedSamplesDF.shape:{}".format(failedSamplesDF.shape))
    #print(failedSamplesDF)
    
testFindSamplesWithMissingQuantFile()


workspace: TCGA_PCPG_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 187
num passed:185
num failed:2
failedSamplesDF.shape:(2, 38)


In [7]:
def createFailedDataSet( rootDir, workspaceName, failedSamplesDF, newDataSetName ):
    sampleSetMembershipDF = readDataModel( rootDir, workspaceName, entityName = "sample_set_membership" ) 
    
    ds = [newDataSetName] * failedSamplesDF.shape[0]
    # create a copy to eliminate 'SettingWithCopyException' warning
    foo = failedSamplesDF.copy()
    foo.loc[:, "membership:sample_set_id"] = ds
    failedSamplesDF = foo
    failedSamplesDF.loc[:, "membership:sample_set_id"] = ds
    
    newMembershipDF = failedSamplesDF.loc[:,['membership:sample_set_id', 'entity:sample_id']]
    newMembershipDF = newMembershipDF.rename( columns={'membership:sample_set_id':'membership:sample_set_id',
                             'entity:sample_id':'sample'} )
    newMembershipDF = sampleSetMembershipDF.append( newMembershipDF )
    
    return newMembershipDF

In [8]:
def insertNewDataSetName(rootDir, workspaceName, newDataSetName) :
    sampleSetEntityDF = readDataModel( rootDir, workspaceName, entityName = "sample_set_entity" )

    newSetDF = pd.DataFrame( {"entity:sample_set_id":[newDataSetName] })
    newSampleSetEntityDF = sampleSetEntityDF.append( newSetDF )

    return newSampleSetEntityDF    


In [9]:
def run( rootDir, workspaceName, newDataSetName, quantFileColName ) :
    failedSamplesDF = findSamplesWithMissingQuantFile(rootDir, workspaceName, quantFileColName)
    if failedSamplesDF.empty :  # or (failedSampleDF.shape[0] == 0)
        return
    
    newMembershipDF      = createFailedDataSet( rootDir, workspaceName, failedSamplesDF, newDataSetName )
    newSampleSetEntityDF = insertNewDataSetName(rootDir, workspaceName, newDataSetName)
    
    saveDataModel(rootDir, workspaceName, "sample_set_entity",     newSampleSetEntityDF)
    saveDataModel(rootDir, workspaceName, "sample_set_membership", newMembershipDF)
    
    numFailed = failedSamplesDF.shape[0]
    return (workspaceName, numFailed)


## figure out which column the results generated by salmonTarQuantWorkflow v 4
This workflow uses the salmon paired read bug fix. The reson the name is not always the the same
is that there are 33 different workspaces. I had to name the output col manual for each run. For unknow reason
Terra would not let define these values in json


In [10]:
def findQuantFileColumnName() :
    dataDict = dict()

    missingList = ['TCGA_DLBC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                   ,'TCGA_GBM_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                   ,'TCGA_LAML_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                   ,'TCGA_SARC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab'
                  ]

    for workspaceName in workSpaceNamesList:
        if workspaceName in missingList:
            continue

        #print(workspaceName)
        df = readDataModel( rootDir, workspaceName, entityName = "sample" )
        quantMatchlist = [s for s in df.columns if "quantF" in s]
        colName = [s for s in quantMatchlist if "3" in s][0]
        #print(colName)
        #print()
        dataDict[workspaceName] = (colName, df) 
        
    return dataDict
        
dataDict = findQuantFileColumnName()        

In [11]:
for workspaceName,t in dataDict.items():
    print()
#     print(workspaceName)
    colName, df = t
    failedSampleDF = findSamplesWithMissingQuantFile(rootDir, workspaceName, colName)
    print("failedSampleDF.shape:{}".format(failedSampleDF.shape))
    print(colName)    



workspace: TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 79
num passed:79
num failed:0
failedSampleDF.shape:(0, 40)
quantFilePaired3


workspace: TCGA_BLCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 427
num passed:427
num failed:0
failedSampleDF.shape:(0, 56)
quantFilePaired3


workspace: TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 1215
num passed:1181
num failed:34
failedSampleDF.shape:(34, 57)
quantFilePaired3


workspace: TCGA_CESC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 309
num passed:309
num failed:0
failedSampleDF.shape:(0, 48)
quantFilePaired3


workspace: TCGA_CHOL_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 45
num passed:45
num failed:0
failedSampleDF.shape:(0, 40)
quantFilePaired3


workspace: TCGA_COAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path 

## explore TCGA_PAAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
Exploration of ESCA, OV, and STAD data set failures revealed that the mRNASeq_fastq_path files where really tar files containing paired fastq file. These 3 data sets had 100% failures when we ran salmonSingleEndReadTask.wdl. This leads to the possiblity that other TCGA data sets might also contain paired fastq files. As a test we ran salmonTarQuantWorkflow.wdl on 2/11. Compare results from quantFile and quantFilePaired

In [12]:
workspaceName = "TCGA_PAAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
# failedSamplesDF      = findSamplesWithMissingQuantFile(rootDir, workspaceName)
# failedSamplesDF

In [13]:
def explore( workspaceName ) :
    sampleDF = readDataModel( rootDir, workspaceName, entityName = "sample" ) 
    colList = ['entity:sample_id', 'sample_type','quantFile', 'quantFilePaired', 'aux_info', 'aux_infoPaired']
    #print( sampleDF.loc[:, colList].head() )
    
    missingQuantFileRows = sampleDF['quantFile'].isna()
    missingQuantFilePairedRows = sampleDF['quantFilePaired'].isna()
    hasMRnaSeqRows = sampleDF['mRNASeq_fastq_path'].notna()
    
    missingSingleAndPairedResultsDF = sampleDF.loc[ missingQuantFileRows & missingQuantFilePairedRows, colList]
    #print("missing singale and paired results\n{}".format(missingSingleAndPairedResultsDF.loc[:, colList]) )
    print("missing single and paired results")
    cl = ['entity:sample_id','quantFile', 'quantFilePaired','sample_type']
    print( missingSingleAndPairedResultsDF.loc[:, cl].groupby('sample_type').count() )
    
    missingSingleOrPairedResultsDF = sampleDF.loc[ missingQuantFileRows | missingQuantFilePairedRows, colList]
    print("\n missing single or paired results")
    print( missingSingleOrPairedResultsDF.loc[:,cl].groupby('sample_type').count() )   
    
    tpRows = sampleDF['sample_type'] == "TP"
 
    printableDF = sampleDF.loc[(tpRows & (missingQuantFileRows | missingQuantFilePairedRows)), cl]
    printableDF.loc[:,"quantFile.isna"] = printableDF.loc[:,"quantFile"].isna()
    printableDF.loc[:,"quantFilePaired.isna"] = printableDF.loc[:,"quantFilePaired"].isna()
    print("\ntype == TP and either quantFile or quantFilePaired is missing")
    print(printableDF.loc[:, ['entity:sample_id', 'sample_type', 'quantFile.isna', 'quantFilePaired.isna']])
    
    print("\n\n are these truely single end reads?")
    r = (~ printableDF.loc[:, 'quantFile.isna']) & printableDF.loc[:, 'quantFilePaired.isna'] 
    print( printableDF.loc[r, ['entity:sample_id', 'sample_type', 'quantFile.isna', 'quantFilePaired.isna']])
    
          
explore( workspaceName )

missing single and paired results
             entity:sample_id  quantFile  quantFilePaired
sample_type                                              
NB                        153          0                0
NT                         33          0                0
TP                          7          0                0

 missing single or paired results
             entity:sample_id  quantFile  quantFilePaired
sample_type                                              
NB                        153          0                0
NT                         33          0                0
TP                          8          0                1

type == TP and either quantFile or quantFilePaired is missing
    entity:sample_id sample_type  quantFile.isna  quantFilePaired.isna
98   PAAD-F2-A8YN-TP          TP            True                 False
134  PAAD-FZ-5919-TP          TP            True                  True
136  PAAD-FZ-5920-TP          TP            True                  True
138 

### explore job results
https://app.terra.bio/#workspaces/test-aedavids-proj/TCGA_PAAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/job_history/922ad136-e8ad-43f5-b946-5669a5a16d04

all 4 sample (PAAD-F2-7273-TP, PAAD-HZ-7924-TP, PAAD-HZ-8519-TP, PAAD-IB-8127-TP ) failed the same way

```
Task quantify.tarToFastqTask:NA:1 failed

while running "-c grep -E -q 'OutOfMemory|Killed' /cromwell_root/stderr ; echo $? > /cromwell_root/memory_retry_rc": unexpected exit status 1 was not ignored [CheckingForMemoryRetry] Unexpected exit status 1 while running "-c grep -E -q 'OutOfMemory|Killed' /cromwell_root/stderr ; echo $? > /cromwell_root/memory_retry_rc": sh: write error: No space left on device
```

## run all workspaces
<span style="color:red">check out missingList in findQuantFileColumnName. we do not test all workspaces</span>

In [14]:
dataDict = findQuantFileColumnName()        
workspacesWithFailedSamples = list()

# for workspaceName in workSpaceNamesList :
# bugFixWS = ["TCGA_SARC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab", 
#             "TCGA_LUAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"]
# for workspaceName in bugFixWS :

for workspaceName,t in dataDict.items():
    colName, df = t
    try:
        failed_ws = run( rootDir, workspaceName, newDataSetName, colName )
        if failed_ws != None:
            workspacesWithFailedSamples.append( failed_ws )
    except Exception as e:
        print("\n\nERROR: {} {}\n\n".format(workspaceName, e))

print("\n\n set name:{} workspaces with failed samples".format(newDataSetName))
for ws in workspacesWithFailedSamples:
    print( ws )
print("check out missingList in findQuantFileColumnName. we do not test all workspaces")

# workspaceName = "TCGA_KICH_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab"
# run( rootDir, workspaceName, newDataSetName )


workspace: TCGA_ACC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 79
num passed:79
num failed:0

workspace: TCGA_BLCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 427
num passed:427
num failed:0

workspace: TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 1215
num passed:1181
num failed:34
writing ../../../terraDataModels/test-aedavids-proj/TCGA/TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/sample_set_entity.tsv
writing ../../../terraDataModels/test-aedavids-proj/TCGA/TCGA_BRCA_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab/sample_set_membership.tsv

workspace: TCGA_CESC_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 309
num passed:309
num failed:0

workspace: TCGA_CHOL_ControlledAccess_V1-0_DATA_edu_ucsc_kim_lab
number of mRNASeq_fastq_path files: 45
num passed:45
num failed:0

workspace: TCGA_COAD_ControlledAccess_V1-0_DATA_edu_ucsc_kim_